Overview

Cluster analysis is a powerful toolkit in the data science workbench. It is used to find groups of observations (clusters) that share similar characteristics. These similarities can inform all kinds of business decisions; for example, in marketing, it is used to identify distinct groups of customers for which advertisements can be tailored. We will explore two commonly used clustering methods - hierarchical clustering and k-means clustering. We’ll build a strong intuition for how they work and how to interpret their results. We’ll develop this intuition by exploring three different datasets: soccer player positions, wholesale customer spending data, and longitudinal occupational wage data.

Libraries

library(dplyr)
library(ggplot2)
library(dummies)
library(readr)
library(dendextend)
library(purrr)
library(cluster)
library(tibble)
library(tidyr)

Warning message:
"package 'tibble' was built under R version 3.6.3"Warning message:
"package 'tidyr' was built under R version 3.6.3"

Calculating distance between observations

Cluster analysis seeks to find groups of observations that are similar to one another, but the identified groups are different from each other. This similarity/difference is captured by the metric called distance. We will calculate the distance between observations for both continuous and categorical features. We will also develop an intuition for how the scales of features can affect distance.

What is cluster analysis?

A form of exploratory data analysis (EDA) where observations are divided into meaningful groups that share common characteristics(features).

When to cluster?

Identifying distinct groups of stocks that follow similar trading patterns.
Using consumer behavior data to identify distinct segments within a market.

Distance between two observations

Distance vs Similarity

$distance = 1 - similarity$

dist() function

two_players = data.frame(X=c(0, 9), Y=c(0,12))
two_players %>%
    dist(method="euclidean")

   1
2 15

three_players = data.frame(X=c(0, 9, -2), Y=c(0,12, 19))
three_players %>%
    dist()

         1        2
2 15.00000         
3 19.10497 13.03840

players <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/94af7037c5834527cc8799a9723ebf3b5af73015/lineup.rds")))
head(players)

x	y
-1	1
-2	-3
8	6
7	-8
-12	8
-15	0

# Plot the positions of the players
ggplot(two_players, aes(x = X, y = Y)) + 
  geom_point() +
  # Assuming a 40x60 field
  lims(x = c(-30,30), y = c(-20, 20))

# Split the players data frame into two observations
player1 <- two_players[1, ]
player2 <- two_players[2, ]

# Calculate and print their distance using the Euclidean Distance formula
player_distance <- sqrt( (player1$X - player2$X)^2 + (player1$Y - player2$Y)^2 )
player_distance

The dist() function makes life easier when working with many dimensions and observations.

dist(three_players)

         1        2
2 15.00000         
3 19.10497 13.03840

three_players

X	Y
0	0
9	12
-2	19

The importance of scale

when a variable is on a larger scale than other variables in data it may disproportionately influence the resulting distance calculated between the observations.

scale() function by default centers & scales column features.

Measuring distance for categorical data

Dummication in R

dummy.data.frame()

job_survey = read.csv("datasets/job_survey.csv")
job_survey

job_satisfaction	is_happy
Low	No
Low	No
Hi	Yes
Low	No
Mid	No

# Dummify the Survey Data
dummy_survey <- dummy.data.frame(job_survey)

# Calculate the Distance
dist_survey <- dist(dummy_survey, method="binary")
# Print the Distance Matrix
dist_survey

Warning message in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
"non-list contrasts argument ignored"Warning message in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
"non-list contrasts argument ignored"

          1         2         3         4
2 0.0000000                              
3 1.0000000 1.0000000                    
4 0.0000000 0.0000000 1.0000000          
5 0.6666667 0.6666667 1.0000000 0.6666667

this distance metric successfully captured that observations 1 and 2 are identical (distance of 0)

Hierarchical clustering

How do you find groups of similar observations (clusters) in data using the calculated distances? We will explore the fundamental principles of hierarchical clustering - the linkage criteria and the dendrogram plot - and how both are used to build clusters. We will also explore data from a wholesale distributor in order to perform market segmentation of clients using their spending habits.

Comparing more than two observations

Hierarchical clustering

Complete Linkage: maximum distance between two sets
Single Linkage: minimum distance between two sets
Average Linkage: average distance between two sets

Capturing K clusters

Hierarchical clustering in R

hclust() function to calculate the iterative linkage steps
cutree() function to extract the cluster assignments for the desired number (k) of clusters.

positions of 12 players at the start of a 6v6 soccer match.

head(players)

x	y
-1	1
-2	-3
8	6
7	-8
-12	8
-15	0

dist_players = dist(players, method = "euclidean")
hc_players = hclust(dist_players, method = "complete")

Extracting K clusters

(cluster_assignments <- cutree(hc_players, k=2))

head(
    players_clustered <- players %>%
        mutate(cluster = cluster_assignments), 
    10)

x	y	cluster
-1	1	1
-2	-3	1
8	6	2
7	-8	2
-12	8	1
-15	0	1
-13	-10	1
15	16	2
21	2	2
12	-15	2

players_clustered data frame contains the x & y positions of 12 players at the start of a 6v6 soccer game to which we have added clustering assignments based on the following parameters:

Distance: Euclidean
Number of Clusters (k): 2
Linkage Method: Complete

Exploring the clusters

# Count the cluster assignments
count(players_clustered, cluster)

cluster	n
1	6
2	6

Visualizing K Clusters

Because clustering analysis is always in part qualitative, it is incredibly important to have the necessary tools to explore the results of the clustering.

players_clustered %>%
    ggplot(aes(x=x, y=y, color=factor(cluster)))+
    geom_point()

Visualizing the dendrogram

Plotting the dendrogram

plot(hc_players)

# Prepare the Distance Matrix
dist_players <- dist(players)

# Generate hclust for complete, single & average linkage methods
hc_complete <- hclust(dist_players, method="complete")
hc_single <- hclust(dist_players, method="single")
hc_average <- hclust(dist_players, method="average")

# Plot & Label the 3 Dendrograms Side-by-Side
# Hint: To see these Side-by-Side run the 4 lines together as one command
par(mfrow = c(1,3))
plot(hc_complete, main = 'Complete Linkage')
plot(hc_single, main = 'Single Linkage')
plot(hc_average, main = 'Average Linkage')

Height of the tree

An advantage of working with a clustering method like hierarchical clustering is that you can describe the relationships between your observations based on both the distance metric and the linkage metric selected (the combination of which defines the height of the tree).

Cutting the tree

Coloring the dendrogram - height

dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, h=15)
plot(dend_colored)

dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, h=10)
plot(dend_colored)

Coloring the dendrogram - K

dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, k=2)
plot(dend_colored)

cutree() using height

cluster_assignments <- cutree(hc_players, h=15)
cluster_assignments

head(
    players_clustered <- 
    players %>%
        mutate(cluster = cluster_assignments),
    10)

x	y	cluster
-1	1	1
-2	-3	1
8	6	2
7	-8	3
-12	8	4
-15	0	4
-13	-10	5
15	16	2
21	2	6
12	-15	3

dist_players <- dist(players, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")

# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)

# Plot the dendrogram
plot(dend_players)

# Color branches by cluster formed from the cut at a height of 20 & plot
dend_20 <- color_branches(dend_players, h = 20)

# Plot the dendrogram with clusters colored below height 20
plot(dend_20)

# Color branches by cluster formed from the cut at a height of 40 & plot
dend_40 <- color_branches(dend_players, h=40)

# Plot the dendrogram with clusters colored below height 40
plot(dend_40)

The height of any branch is determined by the linkage and distance decisions (in this case complete linkage and Euclidean distance). While the members of the clusters that form below a desired height have a maximum linkage+distance amongst themselves that is less than the desired height.

Making sense of the clusters

Wholesale dataset

45 observations
3 features:
- Milk Spending
- Grocery Spending
- Frozen Food Spending

head(
    ws_customers <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/3558d2b5564714d85120cb77a904a2859bb3d03e/ws_customers.rds"))) , 
    10
)

Milk	Grocery	Frozen
11103	12469	902
2013	6550	909
1897	5234	417
1304	3643	3045
3199	6986	1455
4560	9965	934
879	2060	264
6243	6360	824
13316	20399	1809
5302	9785	364

Exploring more than 2 dimensions

Plot 2 dimensions at a time
Visualize using PCA
Summary statistics by feature

Segment wholesale customers

# Calculate Euclidean distance between customers
dist_customers <- dist(ws_customers, method="euclidean")

# Generate a complete linkage analysis 
hc_customers <- hclust(dist_customers, method="complete")

# Plot the dendrogram
plot(hc_customers)

# Create a cluster assignment vector at h = 15000
clust_customers <- cutree(hc_customers, h=15000)

# Generate the segmented customers data frame
head(
    segment_customers <- mutate(ws_customers, cluster = clust_customers), 
    10
)

Milk	Grocery	Frozen	cluster
11103	12469	902	1
2013	6550	909	2
1897	5234	417	2
1304	3643	3045	2
3199	6986	1455	2
4560	9965	934	2
879	2060	264	2
6243	6360	824	2
13316	20399	1809	3
5302	9785	364	2

Explore wholesale customer clusters

Since we are working with more than 2 dimensions it would be challenging to visualize a scatter plot of the clusters, instead we will rely on summary statistics to explore these clusters. We will analyze the mean amount spent in each cluster for all three categories.

# Count the number of customers that fall into each cluster
count(segment_customers, cluster)

cluster	n
1	5
2	29
3	5
4	6

# Color the dendrogram based on the height cutoff
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h=15000)

# Plot the colored dendrogram
plot(dend_colored)

# Calculate the mean for each category
segment_customers %>% 
  group_by(cluster) %>% 
  summarise_all(list(mean))

cluster	Milk	Grocery	Frozen
1	16950.000	12891.400	991.200
2	2512.828	5228.931	1795.517
3	10452.200	22550.600	1354.800
4	1249.500	3916.833	10888.667

Customers in cluster 1 spent more money on Milk than any other cluster.
Customers in cluster 3 spent more money on Grocery than any other cluster.
Customers in cluster 4 spent more money on Frozen goods than any other cluster.
The majority of customers fell into cluster 2 and did not show any excessive spending in any category.

whether they are meaningful depends heavily on the business context of the clustering.

K-means clustering

Build an understanding of the principles behind the k-means algorithm, explore how to select the right k when it isn’t previously known, and revisit the wholesale data from a different perspective.

Introduction to K-means

head(
    lineup <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/94af7037c5834527cc8799a9723ebf3b5af73015/lineup.rds"))) ,
    10
)

x	y
-1	1
-2	-3
8	6
7	-8
-12	8
-15	0
-13	-10
15	16
21	2
12	-15

kmeans()

model = kmeans(lineup, centers = 2)

Assigning clusters

head(model)

$cluster

$centers

x	y
14.83333	0.1666667
-11.33333	-0.5000000

$totss

3489.91666666667

$withinss

863.666666666667
570.833333333333

$tot.withinss

1434.5

$betweenss

2055.41666666667

model$cluster

head(
    lineup_clustered <- lineup %>%
        mutate(cluster=model$cluster), 
    10
)

x	y	cluster
-1	1	2
-2	-3	2
8	6	1
7	-8	1
-12	8	2
-15	0	2
-13	-10	2
15	16	1
21	2	1
12	-15	1

K-means on a soccer field

# Build a kmeans model
model_km2 <- kmeans(lineup, centers = 2)

# Extract the cluster assignment vector from the kmeans model
clust_km2 <- model_km2$cluster

# Create a new data frame appending the cluster assignment
lineup_km2 <- mutate(lineup, cluster = clust_km2)

# Plot the positions of the players and color them using their cluster
ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

K-means on a soccer field (part 2)

# Build a kmeans model
model_km3 <- kmeans(lineup, centers=3)

# Extract the cluster assignment vector from the kmeans model
clust_km3 <- model_km3$cluster

# Create a new data frame appending the cluster assignment
lineup_km3 <- mutate(lineup, cluster=clust_km3)

# Plot the positions of the players and color them using their cluster
ggplot(lineup_km3, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

Evaluating different values of K by eye

Generating the elbow plot

tot_withinss <- map_dbl(1:10, function(k){
    model_l <- kmeans(lineup, centers = k)
    model_l$tot.withinss
})
head(
    elbow_df_l <- data.frame(
    k=1:10,
    tot_withinss = tot_withinss
), 10)

k	tot_withinss
1	3489.9167
2	1434.5000
3	881.2500
4	622.5000
5	481.6667
6	265.1667
7	350.5000
8	96.5000
9	82.0000
10	73.5000

Generating the elbow plot

elbow_df_l %>%
    ggplot(aes(x=k, y=tot_withinss)) +
    geom_line() +
    scale_x_continuous(breaks = 1:10)

Silhouette analysis: observation level performance

Silhouette analysis

Silhouette analysis allows you to calculate how similar each observations is with the cluster it is assigned relative to other clusters. This metric (silhouette width) ranges from -1 to 1 for each observation in your data and can be interpreted as follows:

Values close to 1 suggest that the observation is well matched to the assigned cluster
Values close to 0 suggest that the observation is borderline matched between two clusters
Values close to -1 suggest that the observations may be assigned to the wrong cluster

Calculating S(i)

pam_k3 <- pam(lineup, k=3)
head(pam_k3$silinfo$widths, 10)

	cluster	neighbor	sil_width
4	1	2	0.465320054
2	1	3	0.321729341
10	1	2	0.311385893
1	1	3	0.271890169
9	2	1	0.443606497
8	2	1	0.398547473
12	2	1	0.393982685
3	2	1	-0.009151755
11	3	1	0.546797052
6	3	1	0.529967901

Silhouette plot

sil_plot <- silhouette(pam_k3)
plot(sil_plot)

# Generate a k-means model using the pam() function with a k = 2
pam_k2 <- pam(lineup, k = 2)

# Plot the silhouette visual for the pam_k2 model
plot(silhouette(pam_k2))

for k = 2, no observation has a silhouette width close to 0? What about the fact that for k = 3, observation 3 is close to 0 and is negative? This suggests that k = 3 is not the right number of clusters.

Average silhouette width

pam_k3$silinfo$avg.width

0.353414012920685

1: Well matched to each cluster
0: Onborder between clusters
-1: Poorly matched to each cluster

Highest average silhouette width

sil_width <- map_dbl(2:10, function(k){
    model_sil <- pam(lineup, k=k)
    model_sil$silinfo$avg.width
})
head(sil_df <- data.frame(
    k = 2:10,
    sil_width = sil_width
), 10)

k	sil_width
2	0.4164141
3	0.3534140
4	0.3535534
5	0.3724115
6	0.3436130
7	0.3236397
8	0.3275222
9	0.2547311
10	0.2099424

Choosing K using average silhouette width

sil_df %>%
    ggplot(aes(x=k, y=sil_width)) +
    geom_line() +
    scale_x_continuous(breaks = 2:10)

Making sense of the K-means clusters

Segmenting with K-means

Estimate the “best” k using average silhouette width
Run k-means with the suggested k
Characterize the spending habits of these clusters of customers

Revisiting wholesale data: “Best” k

head(ws_customers, 10)

Milk	Grocery	Frozen
11103	12469	902
2013	6550	909
1897	5234	417
1304	3643	3045
3199	6986	1455
4560	9965	934
879	2060	264
6243	6360	824
13316	20399	1809
5302	9785	364

# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
  model <- pam(x = ws_customers, k = k)
  model$silinfo$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

k = 2 has the highest average sillhouette width and is the “best” value of k we will move forward with.

Revisiting wholesale data: Exploration

From the previous analysis we have found that k = 2 has the highest average silhouette width. We will continue to analyze the wholesale customer data by building and exploring a kmeans model with 2 clusters.

set.seed(42)

# Build a k-means model for the customers_spend with a k of 2
model_customers <- kmeans(ws_customers, centers=2)

# Extract the vector of cluster assignments from the model
clust_customers <- model_customers$cluster

# Build the segment_customers data frame
segment_customers <- mutate(ws_customers, cluster = clust_customers)

# Calculate the size of each cluster
count(segment_customers, cluster)

cluster	n
1	35
2	10

# Calculate the mean for each category
segment_customers %>% 
  group_by(cluster) %>% 
  summarise_all(list(mean))

cluster	Milk	Grocery	Frozen
1	2296.257	5004	3354.343
2	13701.100	17721	1173.000

It seems that in this case cluster 1 consists of individuals who proportionally spend more on Frozen food while cluster 2 customers spent more on Milk and Grocery. When we explored this data using hierarchical clustering, the method resulted in 4 clusters while using k-means got us 2. Both of these results are valid, but which one is appropriate for this would require more subject matter expertise. Generating clusters is a science, but interpreting them is an art.

Case Study: National Occupational mean wage

Explore how the average salary amongst professions have changed over time.

Occupational wage data

data from the Occupational Employment Statistics (OES) program which produces employment and wage estimates annually. This data contains the yearly average income from 2001 to 2016 for 22 occupation groups. We would like to use this data to identify clusters of occupations that maintained similar income trends.

oes <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/1e1ec9f146a25d7c71a6f6f0f46c3de7bcefd36c/oes.rds")))
head(oes)

	2001	2002	2003	2004	2005	2006	2007	2008	2010	2011	2012	2013	2014	2015	2016
Management	70800	78870	83400	87090	88450	91930	96150	100310	105440	107410	108570	110550	112490	115020	118020
Business Operations	50580	53350	56000	57120	57930	60000	62410	64720	67690	68740	69550	71020	72410	73800	75070
Computer Science	60350	61630	64150	66370	67100	69240	72190	74500	77230	78730	80180	82010	83970	86170	87880
Architecture/Engineering	56330	58020	60390	63060	63910	66190	68880	71430	75550	77120	79000	80100	81520	82980	84300
Life/Physical/Social Sci.	49710	52380	54930	57550	58030	59660	62020	64280	66390	67470	68360	69400	70070	71220	72930
Community Services	34190	34630	35800	37050	37530	39000	40540	41790	43180	43830	44240	44710	45310	46160	47200

22 Occupation Observations
15 Measurements of Average Income from 2001-2016

glimpse(oes)

 num [1:22, 1:15] 70800 50580 60350 56330 49710 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:22] "Management" "Business Operations" "Computer Science" "Architecture/Engineering" ...
  ..$ : chr [1:15] "2001" "2002" "2003" "2004" ...

summary(oes)

      2001            2002            2003            2004      
 Min.   :16720   Min.   :17180   Min.   :17400   Min.   :17620  
 1st Qu.:26728   1st Qu.:27393   1st Qu.:27858   1st Qu.:28535  
 Median :34575   Median :35205   Median :36180   Median :37335  
 Mean   :37850   Mean   :39701   Mean   :41018   Mean   :42275  
 3rd Qu.:49875   3rd Qu.:53108   3rd Qu.:55733   3rd Qu.:57443  
 Max.   :70800   Max.   :78870   Max.   :83400   Max.   :87090  
      2005            2006            2007            2008       
 Min.   :17840   Min.   :18430   Min.   :19440   Min.   : 20220  
 1st Qu.:29043   1st Qu.:29688   1st Qu.:30810   1st Qu.: 31643  
 Median :37790   Median :39030   Median :40235   Median : 41510  
 Mean   :42775   Mean   :44329   Mean   :46074   Mean   : 47763  
 3rd Qu.:58005   3rd Qu.:59915   3rd Qu.:62313   3rd Qu.: 64610  
 Max.   :88450   Max.   :91930   Max.   :96150   Max.   :100310  
      2010             2011             2012             2013       
 Min.   : 21240   Min.   : 21430   Min.   : 21380   Min.   : 21580  
 1st Qu.: 32863   1st Qu.: 33430   1st Qu.: 33795   1st Qu.: 34120  
 Median : 42995   Median : 43610   Median : 44055   Median : 44565  
 Mean   : 49758   Mean   : 50555   Mean   : 51077   Mean   : 51800  
 3rd Qu.: 67365   3rd Qu.: 68423   3rd Qu.: 69253   3rd Qu.: 70615  
 Max.   :105440   Max.   :107410   Max.   :108570   Max.   :110550  
      2014             2015             2016       
 Min.   : 21980   Min.   : 22850   Min.   : 23850  
 1st Qu.: 34718   1st Qu.: 35425   1st Qu.: 36350  
 Median : 45265   Median : 46075   Median : 46945  
 Mean   : 52643   Mean   : 53785   Mean   : 55117  
 3rd Qu.: 71825   3rd Qu.: 73155   3rd Qu.: 74535  
 Max.   :112490   Max.   :115020   Max.   :118020

dim(oes)

there are no missing values, no categorical and the features are on the same scale.

Next steps: hierarchical clustering

Evaluate whether pre-processing is necessary
Create a distance matrix
Build a dendrogram
Extract clusters from dendrogram
Explore resulting clusters

Hierarchical clustering: Occupation trees

The oes data is ready for hierarchical clustering without any preprocessing steps necessary. We will take the necessary steps to build a dendrogram of occupations based on their yearly average salaries and propose clusters using a height of 100,000.

# Calculate Euclidean distance between the occupations
dist_oes <- dist(oes, method = "euclidean")

# Generate an average linkage analysis 
hc_oes <- hclust(dist_oes, method = "average")

# Create a dendrogram object from the hclust variable
dend_oes <- as.dendrogram(hc_oes)

# Plot the dendrogram
plot(dend_oes)

# Color branches by cluster formed from the cut at a height of 100000
dend_colored <- color_branches(dend_oes, h = 100000)

# Plot the colored dendrogram
plot(dend_colored)

Based on the dendrogram it may be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another. Let’s continue this exploration.

Hierarchical clustering: Preparing for exploration

We have now created a potential clustering for the oes data, before we can explore these clusters with ggplot2 we will need to process the oes data matrix into a tidy data frame with each occupation assigned its cluster.

# Use rownames_to_column to move the rownames into a column of the data frame
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')

# Create a cluster assignment vector at h = 100,000
cut_oes <- cutree(hc_oes, h = 100000)

# Generate the segmented the oes data frame
clust_oes <- mutate(df_oes, cluster = cut_oes)

# Create a tidy data frame by gathering the year and values into two columns
head(gathered_oes <- gather(data = clust_oes, 
                       key = year, 
                       value = mean_salary, 
                       -occupation, -cluster), 10)

occupation	cluster	year	mean_salary
Management	1	2001	70800
Business Operations	2	2001	50580
Computer Science	2	2001	60350
Architecture/Engineering	2	2001	56330
Life/Physical/Social Sci.	2	2001	49710
Community Services	3	2001	34190
Legal	1	2001	69030
Education/Training/Library	3	2001	39130
Arts/Design/Entertainment	3	2001	39770
Healthcare Practitioners	2	2001	49930

Hierarchical clustering: Plotting occupational clusters

We have succesfully created all the parts necessary to explore the results of this hierarchical clustering work. We will leverage the named assignment vector cut_oes and the tidy data frame gathered_oes to analyze the resulting clusters.

# View the clustering assignments by sorting the cluster assignment vector
sort(cut_oes)

Management: 1
Legal: 1
Business Operations: 2
Computer Science: 2
Architecture/Engineering: 2
Life/Physical/Social Sci.: 2
Healthcare Practitioners: 2
Community Services: 3
Education/Training/Library: 3
Arts/Design/Entertainment: 3
Healthcare Support: 3
Protective Service: 3
Food Preparation: 3
Grounds Cleaning & Maint.: 3
Personal Care: 3
Sales: 3
Office Administrative: 3
Farming/Fishing/Forestry: 3
Construction: 3
Installation/Repair/Maint.: 3
Production: 3
Transportation/Moving: 3

# Plot the relationship between mean_salary and year and color the lines by the assigned cluster
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) + 
    geom_line(aes(group = occupation))

From this work it looks like both Management & Legal professions (cluster 1) experienced the most rapid growth in these 15 years. Let’s see what we can get by exploring this data using k-means.

Reviewing the HC results

Next steps: k-means clustering

Evaluate whether pre-processing is necessary
Estimate the “best” k using the elbow plot
Estimate the “best” k using the maximum average silhouette width
Explore resulting clusters

K-means: Elbow analysis

leverage the k-means elbow plot to propose the “best” number of clusters.

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = oes, centers = k)
  model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

K-means: Average Silhouette Widths

So hierarchical clustering resulting in 3 clusters and the elbow method suggests 2. We will use average silhouette widths to explore what the “best” value of k should be.

# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
  model <- pam(oes, k = k)
  model$silinfo$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

It seems that this analysis results in another value of k, this time 7 is the top contender (although 2 comes very close).

Comparing the two clustering methods

	Hierarchical Clustering	k-means
Distance Used:	virtually any	euclidean only
Results Stable:	Yes	No
Evaluating # of Clusters:	dendrogram, silhouette,elbow	silhouette,elbow
Computation Complexity:	Relatively Higher	RelativelyLower